{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### 0.数据收集  \n",
    "\n",
    "尽可能关联可用的数据特征，这部分主要是涉及表的关联，加工和清洗等相关操作，这部分可以参考至成老师的培训教程；这里需要特别注意的一点是数据特征(X)要先于目标(y)，比如我们要预测21年2月的某目标，我们匹配的特征数据要加上条件\"select * from xxx where month<=\"202101 ....\"\n",
    "\n",
    "### 一.合理划分训练集、验证集、测试集   \n",
    "\n",
    "#### 1.1各自的作用\n",
    "\n",
    "训练集：用于训练模型；   \n",
    "验证集：用于调参用，用于选择一组模型不错超参数，比如决策树中的最小叶子节点数、树的最大深度，随机采样率等...   \n",
    "测试集：测试模型在线上实际的效果表现   \n",
    "\n",
    "#### 1.2如何划分  \n",
    "\n",
    "例子：   \n",
    "\n",
    "比如目前收集了车险防欺诈数据总共55W条：1）其中包括20年总共有50W条，2）并在20年底产险做过一次费改业务调整，收集了21年1月的5W条数据；   \n",
    "\n",
    "目标：建模预测21年2月份的欺诈案件；   \n",
    "\n",
    "如何划分 训练集/验证集/测试集？(假设整体欺诈占比为10%)    \n",
    "\n",
    "||训练集|验证集|测试集|\n",
    "|---|---|---|---|\n",
    "|第1组|30w(20)+3w(21)|10w(20)+1w(21)|10w(20)+1w(21)|\n",
    "|第2组|40w(20)|10w(20)|5w(21)|\n",
    "|第3组|40w(20)+3w(21)|5w(20)| 2w(21)| \n",
    "|第4组|50w(20)|1.5w(21)| 3.5w(21)| \n",
    "|第5组|50w(20)+2w(21)|0.5w(21)| 2.5w(21)| "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 二.理解数据\n",
    "这部分主要是对数据有个大致的了解，这个过程主要关注一下下面几点：     \n",
    "\n",
    ">1）基本信息：字段含义，数据规模，字段类型，字段取值范围，饱和度，这部分信息可以帮组我们直接去掉某些特征，空值填充，更具字段含义加工其他特征等等....    \n",
    "\n",
    ">2）特征X目标的交叉信息(单因子分析)：主要了解不同特征数据分布中目标的统计量（比如均值）   \n",
    "\n",
    ">>2.1）这部分可以帮助我们进一步了解特征，比如在某某特征在某某范围内，目标的取值更加具有倾向性，比如查勘定损金额(特征)在5000元以上时，欺诈率(目标)为10%，而金额在5000元以下时，欺诈率为1%，说明\"查勘定损金额>=5000元\"是个有用的显著特征，可以帮助我们单独构建一个特征；  \n",
    "\n",
    ">>2.2）发觉一些异常的特征，比如车牌号为\"*\"的取值，对应的赔付额远高于均值；  \n",
    "\n",
    ">>2.3）相关性分析：与y目标相关性太高的因子，需要重视\n",
    "\n",
    ">3）训练集和验证集的分布差异，这里可以使用PSI指标，定义如下：   \n",
    "\n",
    "$$\n",
    "PSI(p,q)=KL(p||q)+KL(q||p)\n",
    "$$   \n",
    "> KL的定义如下： \n",
    "$$\n",
    "KL(p||q)=\\sum_{i}p_i(log\\frac{p_i}{q_i})\n",
    "$$  \n",
    "\n",
    "### 三.代码演示"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "import toad"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "diretory=\"xxx\"\n",
    "total_df = pd.read_csv(diretory+'xxx.csv', low_memory=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.1构建y标签：是否出险"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "claim = pd.read_csv(diretory+'/data/claimnum.csv')\n",
    "claim = claim[['polnum_com', 'num']]\n",
    "claim = claim.rename(columns={'polnum_com': 'policy_no'})\n",
    "total_df=pd.merge(total_df,claim,on=\"policy_no\",how=\"inner\")\n",
    "total_df['y'] = np.where(total_df['num']>0, 1, 0)\n",
    "total_df['y'].mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "total_df=total_df.drop(\"num\",axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.2删掉高基类数据\n",
    "主要是一些id类和时间类数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "total_data_info_df=toad.detector.detect(total_df).reset_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "total_data_info_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "total_data_info_df[(total_data_info_df['unique']>200) & (total_data_info_df['type']=='object')]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "drop_cols=total_data_info_df[(total_data_info_df['unique']>200) & (total_data_info_df['type']=='object')]['index'].tolist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "keep_cols=[\"xxx\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "set(drop_cols)-set(keep_cols)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "total_df=total_df.drop(set(drop_cols)-set(keep_cols),axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def transform_float(x):\n",
    "    try:\n",
    "        rst=float(x)\n",
    "    except:\n",
    "        rst=-1\n",
    "    return rst\n",
    "keep_cols.remove(\"effdate\")\n",
    "for col in keep_cols:\n",
    "    print(col)\n",
    "    total_df[col]=total_df[col].apply(transform_float)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.3 去掉unique=1的因子"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "keep_cols=total_data_info_df[total_data_info_df['unique']>1]['index'].tolist()\n",
    "keep_cols=list(set(total_df.columns) & set(keep_cols))\n",
    "total_df=total_df[keep_cols]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.4 去掉missing>90的数据\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "total_data_info_df['missing']=total_data_info_df['missing'].apply(lambda x:float(x[:3]))\n",
    "keep_cols=total_data_info_df[total_data_info_df['missing']<85.0]['index'].tolist()\n",
    "total_df=total_df[list(set(keep_cols) & set(total_df.columns))]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "total_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.5单因子分析"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "dfs=[]\n",
    "max_bins=10\n",
    "cols=total_df.columns.tolist()\n",
    "cols.remove('y')\n",
    "cols.remove(\"effdate\")\n",
    "for col in cols:\n",
    "    print(col)\n",
    "    if total_df[col].dtype=='float':\n",
    "        x_df=total_df[total_df[col]>0][[col,'y']]\n",
    "        thresholds=[]\n",
    "        range_list=sorted(x_df[col].values)\n",
    "        range_set_num=len(set(range_list))\n",
    "        bins=min(max_bins,range_set_num)\n",
    "        if bins==1:\n",
    "            print('continue')\n",
    "            continue\n",
    "        for percent in range(0,bins):\n",
    "            percent_value = np.round(np.percentile(range_list, (1.0 * percent / (bins-1)) * 100.0),2)\n",
    "            thresholds.append(percent_value)\n",
    "        thresholds=sorted(list(set(thresholds)))\n",
    "        print(thresholds)\n",
    "        x_df[col]=np.digitize(x_df[col], thresholds)\n",
    "        x_df['value']=1\n",
    "        x_df=x_df.groupby([col,'y']).agg(['count']).reset_index()\n",
    "        x_df.columns=['col','y','count']\n",
    "        x_df=x_df.pivot(index='col', columns='y', values='count').reset_index()\n",
    "        if x_df.shape[1]==3:\n",
    "            x_df.columns=['col','neg','pos']\n",
    "        else:\n",
    "            if 0 in x_df.columns:\n",
    "                x_df['neg']=x_df[0]\n",
    "                del x_df[0]\n",
    "                x_df['pos']=0\n",
    "            else:#1\n",
    "                x_df['pos']=x_df[1]\n",
    "                del x_df[1]\n",
    "                x_df['neg']=0\n",
    "            x_df=x_df[['col','neg','pos']]\n",
    "        x_df['col']=x_df['col'].apply(lambda x:\"[\"+str(thresholds[x-1])+\",\"+str(thresholds[min(x,len(thresholds)-1)])+\")\")\n",
    "        x_df['pos_by_neg_rate']=x_df['pos']/(x_df['neg']+1e-7)\n",
    "        x_df['base_rate']=x_df['pos'].sum()/(x_df['neg'].sum()+1e-7)\n",
    "        x_df['over_rate']=x_df['pos_by_neg_rate']/x_df['base_rate']\n",
    "        x_df['index']=col\n",
    "        x_df=x_df[['index','col','pos','neg','pos_by_neg_rate','base_rate','over_rate']]\n",
    "        dfs.append(x_df)\n",
    "#         print(x_df)\n",
    "    elif total_df[col].dtype=='object':\n",
    "        x_df=total_df[total_df[col].apply(lambda x:x is not None)][[col,'y']]\n",
    "        x_df['value']=1\n",
    "        x_df=x_df.groupby([col,'y']).agg(['count']).reset_index()\n",
    "        x_df.columns=['col','y','count']\n",
    "        x_df=x_df.pivot(index='col', columns='y', values='count').reset_index()\n",
    "        if x_df.shape[1]==3:\n",
    "            x_df.columns=['col','neg','pos']\n",
    "        else:\n",
    "            if 0 in x_df.columns:\n",
    "                x_df['neg']=x_df[0]\n",
    "                del x_df[0]\n",
    "                x_df['pos']=0\n",
    "            else:\n",
    "                x_df['pos']=x_df[1]\n",
    "                del x_df[1]\n",
    "                x_df['neg']=0\n",
    "            x_df=x_df[['col','neg','pos']]\n",
    "        x_df['pos_by_neg_rate']=x_df['pos']/(x_df['neg']+1e-7)\n",
    "        x_df['base_rate']=x_df['pos'].sum()/(x_df['neg'].sum()+1e-7)\n",
    "        x_df['over_rate']=x_df['pos_by_neg_rate']/x_df['base_rate']\n",
    "        x_df['index']=col\n",
    "        x_df=x_df[['index','col','pos','neg','pos_by_neg_rate','base_rate','over_rate']]\n",
    "        dfs.append(x_df)\n",
    "#         print(x_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#保存结果到excel\n",
    "single_factor_analysis=pd.concat(dfs)\n",
    "single_factor_analysis.columns=['col_name']+single_factor_analysis.columns.tolist()[1:]\n",
    "single_factor_analysis.to_excel(\"./data/单因子分析结果.xlsx\",index=False,header=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.6相关性分析\n",
    "与y相关性太高的因子，可能存在数据泄露问题，需要去掉"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import category_encoders as ce\n",
    "object_cols=total_df.dtypes[total_df.dtypes==object].reset_index()['index'].tolist()\n",
    "total_df[object_cols]=total_df[object_cols].fillna(\"missing\")\n",
    "le=ce.OrdinalEncoder()\n",
    "le.fit(total_df,cols=object_cols)\n",
    "total_df2=le.transform(total_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "total_df2.fillna(-1,inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "corr_df=total_df2.corr()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "corr_df['y'].abs().sort_values(ascending=False)[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里面，xxx,xxx字段为赔付额，出险案例很大概率会涉及赔付，这里应该去掉..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.7 训练集/验证集因子分布差异\n",
    "可以使用PSI来评估"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "trn_df=total_df[total_df['effdate'].apply(lambda x:x[:7]<=\"2019-10\")]\n",
    "\n",
    "dev_test_df=total_df[total_df['effdate'].apply(lambda x:x[:7]>=\"2019-11\")]\n",
    "indice=list(range(0,dev_test_df.shape[0]))\n",
    "np.random.shuffle(indice)\n",
    "dev_test_df=dev_test_df.iloc[indice]\n",
    "dev_df=dev_test_df.iloc[:dev_test_df.shape[0]//2]\n",
    "test_df=dev_test_df.iloc[dev_test_df.shape[0]//2:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "trn_df.shape,dev_df.shape,test_df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "psi=toad.metrics.PSI(trn_df,dev_df).reset_index()\n",
    "psi.columns=['index','psi_value']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "psi.sort_values(\"psi_value\",ascending=False)[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1）psi取值较高的特征可以先不用去掉，后面迭代分析建模时，如果模型的 1）方差 2）或者分布差异影响到了模型效果时再去掉   \n",
    "\n",
    "2）另外，psi取值为0，可能是由于前后完全无法匹配引起的，需要去掉，比如训练集中某特征的取值构成的集合为{a,b,c}，而验证集中取值为{e,f,g}，根据PSI的定义，计算结果会为0   \n",
    "\n",
    "3）这里计算分布差异，也可以才用单因子分析的方法，对训练集和验证集分别做一次单因子分析，然后left join，再继续分析，不过注意数据分箱要才用相同的阈值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
